Staff Engineer- NKP
Hungry, Humble, Honest, with Heart.
The Opportunity
At Nutanix, we are expanding the capabilities of the Nutanix Kubernetes Platform (NKP) to power the next generation of enterprise AI. We are building a comprehensive, Kubernetes-native AI Platform-as-a-Service (PaaS) layer designed to run seamlessly across hybrid, edge, and multi-cloud environments.
As an Senior Engineer on the NKP AI Platform team, you will be a key technical leader responsible for architecting, building, and scaling the infrastructure that allows enterprises to train, deploy, and manage traditional ML and Large Language Models (LLMs) effortlessly.
This is a high-impact, hands-on engineering role. You will bridge the gap between underlying container orchestration and advanced AI/ML workloads. You will be tackling hard problems in distributed systems, GPU scheduling, model serving optimization, and MLOps automation, ensuring our platform is resilient, secure, and highly performant.
About the Team
NKP is a complete, enterprise-grade, production-ready platform based on Kubernetes that serves modern applications. By taking the best of the Cloud Native Computing Foundation (CNCF) ecosystem, but providing it in an integrated and lifecycle-managed package, NKP delivers all the functionality necessary in production with a dramatically lowered TCO. NKP provides simplified Kubernetes operations, seamless application mobility, and integrated security and compliance. Platform engineers and IT admins can use NKP to keep all cloud native resources up-to-date with upgrades, patching, maintenance, and security. The technology and community in this industry are growing fast, new applications and solutions are coming to the market every day, new standards are becoming mature at full speed.
You will report to the Senior Engineering Manager, who adopts a collaborative leadership style that encourages team involvement in decision-making and values open discussions. The manager emphasizes the importance of strong communication and problem-solving skills, aiming to build a supportive environment for both personal and professional development.
This role does not require any travel, allowing the new hire to focus on their responsibilities within the local team while also collaborating effectively with their US counterparts through virtual meetings and communication tools.
Your Role
- Architect the AI PaaS Layer: Lead the design and implementation of NKP's AI platform layer, building cohesive abstractions for serverless model training, hyperparameter tuning, and model serving.
- Extend Kubernetes for AI: Develop Kubernetes Custom Resource Definitions (CRDs), Operators, and Controllers in Golang to natively orchestrate AI workloads.
- Optimize GPU & Hardware Utilization: Build scheduling intelligence to optimize the use of NVIDIA GPUs (utilizing MIG, Time-Slicing, etc.) and AI accelerators for distributed training and inference.
- Integrate the MLOps Ecosystem: Seamlessly integrate and manage the lifecycle of leading open-source AI tools like Kubeflow, KServe, Ray, vLLM, and Triton Inference Server within the NKP stack.
- Build the Inference Engine: Design high-throughput, low-latency model serving infrastructure for LLMs, including capabilities like continuous batching, kv-cache optimization, and dynamic auto-scaling.
- Technical Leadership & Mentorship: Act as a force multiplier for the team. Drive architectural reviews, establish coding standards, and mentor junior and mid-level engineers.
- Cross-Functional Collaboration: Partner closely with Product Managers, the core NKP k8s, and Nutanix Data Services (NDK) to ensure seamless data gravity and persistent storage for AI workloads.
What You Will Bring
- Experience: 8+ years of software engineering experience, with at least 3+ years focused on building scalable distributed systems, cloud platforms, or AI/ML infrastructure.
- Language Proficiency: Deep expertise in Golang (for K8s ecosystem) and Python (for ML ecosystem).
- Kubernetes Mastery: Extensive experience with Kubernetes internals, building K8s Operators, Controllers, and working with the Cluster API (CAPI).
- AI/ML Infrastructure Knowledge: Hands-on experience with MLOps frameworks and distributed AI computing (e.g., Kubeflow, Ray, KServe, PyTorch DDP).
- Systems Engineering: Strong understanding of Linux internals, networking, and container runtimes (containerd, Docker).
- API & Platform Design: Proven track record of designing intuitive, developer-friendly APIs and scalable microservices architectures.
- Education: Bachelor’s or Master’s degree in Computer Science, Computer Engineering, or a related field (or equivalent practical experience).
Nice to Haves:
- Contributions to CNCF projects (Kubernetes, Helm, Istio) or AI open-source projects (vLLM, Ray, Kubeflow).
- Familiarity with vector databases (Milvus, Qdrant, PGVector) and RAG (Retrieval-Augmented Generation) architectures.
- Experience with advanced observability stacks (Prometheus, Grafana, OpenTelemetry) tailored for AI metrics (e.g., tracking GPU utilization, token generation rates, model drift).
Work Arrangement
Hybrid: This role operates in a hybrid capacity, blending the benefits of remote work with the advantages of in-person collaboration. In locations where our workplace policy applies (i.e. San Jose, Durham, Mexico City, Bangalore, Pune, Hoofddorp, Belgrade, Barcelona, Singapore, Sydney and Tokyo), employees are expected to work onsite a minimum of 3 days per week to foster collaboration, team alignment, and access to in-office resources. Workplace type may vary based on location and team requirements. Please speak with your recruiter for details. Additional team-specific guidance and norms will be provided by your manager.
The pay range for this position at commencement of employment is expected to be between CAD $ 208,000 and CAD $ 313,200 per year.
However, base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. The total compensation package for this position may also include other elements, including a sign-on bonus, restricted stock units, and discretionary awards in addition to a full range of medical, financial, and/or other benefits (including various paid time off benefits, such as vacation, sick time, and parental leave), dependent on the position offered. Details of participation in these benefit plans will be provided if an employee receives an offer of employment.
If hired, employee will be in an “at-will position” and the Company reserves the right to modify base salary (as well as any other discretionary payment or compensation program) at any time, including for reasons related to individual performance, Company or individual department/team performance, and market factors. Our application deadline is 40 days from the date of posting. In good faith, the posting may be removed prior to this date if the position is filled or extended in good faith.
--
Nutanix is an equal opportunity employer.
Nutanix is an Equal Employment Opportunity and (in the U.S.) an Affirmative Action employer. Qualified applicants are considered for employment opportunities without regard to race, color, religion, sex, sexual orientation, gender identity or expression, national origin, age, marital status, protected veteran status, disability status or any other category protected by applicable law. We hire and promote individuals solely on the basis of qualifications for the job to be filled. We strive to foster an inclusive working environment that enables all our Nutants to be themselves and to do great work in a safe and welcoming environment, free of unlawful discrimination, intimidation or harassment. As part of this commitment, we will ensure that persons with disabilities are provided reasonable accommodations. If you need a reasonable accommodation, please let us know by contacting [email protected].